Search Results: "srivasta"

22 December 2007

Manoj Srivastava: Manoj: The way of the wolf

This is a SCi-Fi series from E. E. Knight. It is not often that one comes across a brand new series in fantasy or science fiction; and more rarely still when it has the quality of this one (The Vampire Earth series). This is a post-apocolaptic novel, the apocalypse being a virus that killed most of the human population, unleashed by a gate-travelling extra-solar species to disrupt human resistance as they took over. Ostensibly about vampires, it provides an interesting back story to explain master vampires and their reaper thralls. What was captivating about this book is the detailed and generally coherent world building, the swaths of land under outsider control, where there is law and order and culling of humans for food; and the rag tag resistance. The characters are fairly well developed (though the author shies away from romantic relationships of any kind). Not since the Recluse novels have I felt this way about a new series.

25 November 2007

Manoj Srivastava: Manoj: Filtering Accuracy: Brown paper bag time

After posting about filtering accuracy I got to thinking about the test I was using. It appeared to me that there should be no errors in the mails that crm114 had already been trained upon but here I was, coming up with errors when I trained the css files until there were no errors, and then used a reg exp that tried to find the accuracy of classification for all the files, not just new ones. This did not make sense. The only explanation was that my css files were not properly created and I thought to try an experiment where isntead of trying to throw my whole corpus as one chunk at a blank css file, I would feed the corpus in chunks. I cam up with an initialization script to feed my corpus to a blank css file in 200 mail chunks; and, while it was at it, renumber the corpus mails (over the years, as I cleaned the corpus, gaps had appeared in the numbering). I have also updated the retraining script Voila. I am getting a new set of css files which do not appear to show any errors for mails crm114 has already learned about in other words, for mails it has seen, the accuracy is now 100%, not 99.5% as it was a couple of days ago. While it is good news in that my classification accuracy is better than it was last week; the bad news is that I no longer have concrete number on accuracy for crm114 anymore the mechanism used now gives 100% accuracy all the time. The funny thing is, I recall going through this analysis a couple of years ago, where I predicted that one could only check for accuracy with a test corpus that had the same characteristics as real mail inflow, and which had not been used for learning. That wold mean I would have classified testing corpus that could improve the efficiency of my filter, but was not being used to provide accuracy numbers I have gone for improving the filter, at the cost of knowing how accurate they actually are.

Manoj Srivastava: Manoj: The children of men

A nicely placed movie about a bleak future, and how people cope with despair and desperate times. While it did not quite come together in the details (anything outside of England was a big unknown blur), and the London of 2027 seemed not much different from any current day city under semi martial law (technology, for instance, seems to have frozen at todays levels), it was still fast paced, and enjoyable, and anyway, this is not primarily a sci-fi flick. Recommended.

Manoj Srivastava: Manoj: 300, and the history channel perspective.

Yes, this is about a movies based on a comic based on a movie from the 50 s. And they did a wonderful job of conveying to comic book feel and yet, though you could appreciate the abstract, stylized presentation of the comic, most of the movie still came straight from Herodotus. The training of the Spartans, the throwing of the Persian emissaries into a pit and a well this cleaving to the historic details was a pleasant surprise. The history channel presentation is recommended for the perspective it brings to the tale. There were some poetic licenses the whole bit about a highly placed Spartan traitor was made out of plain cloth; and the current convention wisdom is that Leonidas went to Thermopylae because of his religious beliefs, and conviction about the sacred prophecy of the oracle at Delphi, not because he thought Persia would destroy Greece (remember, Xerxes won, and sacked Athens). Indeed, there was little concept of Greece at that point. Indeed, the whole stick about the last stand at Thermopylae saving democracy seems suspect the stand bloodied Persia s nose, and delayed them by perhaps 5 days in an advance that took the better part of a year that the Greeks knew about. No, it was the combination of Marathon, Thermopylae, Salamis, Plataea over the course of half a century that ensure that the no name David of the Greek city states survived against the Goliath of Persia. And, then, of course, came the boy wonder out of Macedonia. Highly recommended.

19 November 2007

Manoj Srivastava: Manoj: Eragon

I liked the book. Sure, it is The Lord of the Rings meets Star Wars , but, the book had a nice flow and it was written by a fifteen year old, fer gawds sake. The very fact that he can turn out a page turner of a book when others of his age can t string together a grammatical sentence spelled correctly is amazing. Overall, derivative, unoriginal, and simplistic though the book is, it has an original charm a very good book for children, and one that adults can read through as well. So I went to this movie with high hopes. What a let down. This was merely a notch above the Beowulf debacle. Lack luster performances, bland, drudge of a movie, with all kinds of interesting elements and nuances from the book removed. Crude, unimaginative, ham handed performances all around. The plot line, which did not follow the book, was dumbed down, there were implications that the Elven princess was a potential love interest (faugh), and the refreshing pace of the book fell off to a plodding soporific caricature. It is an offense to the book, and to the author. I was going to point out the differences between the movie and the book; and why they difference made the movie worse, but after 30 or so items this post would have gotten to be too big. And, having written it, I have the release of the rant, so I no longer have to include it here. Anyway, Wikipedia says that the film came in at #235 in the all time worldwide box office chart but was met with dismal critical reviews, scoring only a 16% composite score on Rotten Tomatoes I feel sorry for you if you suffered through this, as did I.

18 November 2007

Manoj Srivastava: Manoj: The movie vaguely resembling Beowulf: an IMAX 3d experience

This should really be titled A movie vaguely representing Beowulf, but all sexed up with various salubrious elements . Hrothgar was treated much better in the original; and all the blatant and gratuitous sexuality brought in into the movie was a turn off. But then, I might be in the minority of the audience who had any familiarity with the poem. The characters in the movie seemed two dimensional caricatures (the only compelling performance was from Grendel s mother). And the changes made to the story line also lost the prowling menace of the latter years of the king of the Geats. After watching Hollywood debacles like this one, I am driven to wonder about why Hollywood writers seem to think they can so improve upon the work of writers whose story has stood the test of time. Making Beowulf into a boastful liar and cheat (even in the tale of the sea monsters his men imply that that they knew their lord was a liar) in an age where honor and battle prowess were everything I mean, what were the producers thinking? Most certainly not a movie I am going to recommend. I had not researched the movie much before I went into the show, and it was a surprise to me to see that this was an animated movie a la Final Fantasy , and while I was impressed with the computer graphics (reflections in general, and reflections of ripples in the water were astounding), the not a cartoon but not a realistic movie experience was a trifle distracting, and detracted from telling the tale. I like IMAX 3d, and the glasses are improving.

13 November 2007

Manoj Srivastava: Manoj: Deeds of Paksenarrion: III

Oath of Gold rounds up this excellent fantasy series from Elizabeth Moon. It is a pity that she never came back to this character (though she wrote a couple of prequels), despite the fact that the ending paragraph leaves ample room for sequels when the call of Gird came, Paksenarrion left for other lands. This is high fantasy in true Tolkien manner, but faster paced, more gritty, and with characters one could relate to. I am already looking forward to my next re-read of the series.

12 November 2007

Manoj Srivastava: Manoj: Deeds of Paksenarrion: II

Divided Allegiance is the middle of the trilogy, the one that I hate reading. Not because Ms Moon s book is bad, which it is not, it is still as gripping as the others, and comes closer to the high fantasy of Tolkien it is just that I hate what happens to Paks in the book, and the fact that the books ends, leaving her in that state. I guess I am a wimp when it comes to some things that happen to characters I am identifying with. However, it has been so long since I read the series that I have begun to forget the details, so I went through and read it anyway. This is a transition book: the Deeds of Paksenarrion was about Paksenarrion the line warrior, and the final book is where she becomes the stuff of legends. I usually read the first and last here.

10 November 2007

Manoj Srivastava: Manoj: The Secret Servant

I bought this book by Daniel Silva last week at SFO, faced with a long wait for the red eye back home, since I recalled hearing about it on NPR, and reading a review in Time magazine, or was it the New Yorker? Anyway, the review said he is his generation s finest writer of international intrigue, one of America s most gifted spy novelists ever. I guess Graham Greene and John le Carre belong to an older generation. Anyway, everything I read or heard about it was very positive. Daniel Silva is far less cynical than Le Carre, and his world does not gel quite as well, to my ears, as Smiley s circus did. The hero, Gabriel Allon, does have some super human traits, but, thank the lord, is not James bond. I was impressed by Silva s geo-politics, though - paragraphs from the book seem to appear almost verbatim in current event reports in the International Herald Tribune and BBC stories. I like this books (to the extent of ordering another 7 from this author from Amazon today), and appreciate the influx of new blood in the international espionage market. Lately, the genre has been treated by lack luster, mediocre knock offs of the Bourne Identity and the engaging pace of the original has never been successfully replicated in the sequels. And Silva s writing is better than Ludlum s.

Manoj Srivastava: Manoj: Adventures in the windy city

I have just come back from a half week stay at the Hilton Indian Lakes resort (which is the second time in a month that I have stayed at a golf resort and club and proceeded to spend 9 hours a day in a windowless conference room). On Thursday night, an ex Chicago native wanted to show us the traditional Chicago pizza (which can be delivered, half cooked, and frozen, via Fed-Ex, anywhere in the lower 48). Google Maps to the rescue! One of the attendees had a car, and we piled in and drove to the nearest pizzeria. It was take out only. We headed to the next on the list, again to be met with disappointment; since making the pizza takes the best part of an hour, and we did not want to be standing out in a chilly parking lot while they made out pizza. So, I strongly advocated going to Tapas Valencia instead, since I have never had tapas before. Somewhat to our disappointment, they served tapas only as an appetizer, and had a limited selection; so we ended up ordering one tapas dish (I had beef kabobs with a garlic horseradish sauce and caramelized onions), and my very first paella (paella valencia), with shrimp, mussels, clams, chicken, and veggies. We ate well, and headed back to the hotel. As we parked, and started for the gate, I realized I no longer had my wallet with me so back to the restaurant we went. The waiter had not found the wallet. Nor had the busboy. The owner/hostess suggested perhaps it was in the parking lot? So we all went and combed the parking lot once, twice. At this point I am beginning to think about the consequences I can t get home, because I can t get into the airport, since I have no ID. I have no money, but Judy can t wire the money to me via western union because I have no ID. I need money to buy greyhound tickets to get home on a bus and then there is the cancelling credit cards, etc. Panic city. While I was on my fourth circuit of the parking lot, the owner went back and checked the laundry chute. I had apparently carelessly draped the napkin over my wallet when paying the tab, and walked away and the busboy just grabbed all the napkins, wallet and all, and dumped it down the chute. Judy suggests I carry an alternate form of ID and at least one credit card in a different location than my wallet for future trips. If that was not excitement enough, yesterday, I got on the plane home, uneventfully enough. We took off, and I was dozing comfortably, when there were two loud bags, and the plane juddered and lister to the port. There was a smell of burning rubber, and we stopped gaining altitude. After making a rough about turn with the left wing down, the pilot came on the intercom to say We just lost our left engine, and we are returning to O Hare. We should be in the ground in two minutes . Hearing the in , a guy up front started hyperventilating, and his wife was rubbing his back. My feelings were mostly of exasperation, I had just managed to get myself situated comfortably, and now lord only knows when we would get another aircraft. When we landed, the nervous dude reached over and kissed his wife like he had just escaped the jaws of death. And he asked if any of us knew statistics, and if we were fine now. (I was tempted to state that statistics are not really predictive, but hey). It was all very pretty, with six fire engines rushing over and spraying us with foam and all. When we got off the plane the nervous dude headed straight to some chairs in the terminal, and said his legs would not carry him further. He did make it to the replacement plane later, though. Turns out it was a bird flying into the engine that caused the flameout. Well, at least I have a story to tell, though it delayed getting home by about three hours.

8 November 2007

Manoj Srivastava: Manoj: Deeds of Paksenarrion

Sheep Farmers Daughter is an old favourite, which I have read lord only knows how many times. Elizabeth Moon has written a gritty, enthralling story of the making of a Paladin. This is the first book of a trilogy, and introduces us to a new universe through the eyes of a young innocent (which is a great device to introduce us to a universe from the viewpoint of someone who is not seeing it through eyes jaundiced by experience). For me, books have always been an escape from the humdrum mundanity of everyday existence. Putting myself in the shoes of a character in the story is the whole point; and this story excels there: it is very believable. Not many people can tell a tale that comes alive, and Ms Moon is one of them. An ex-marine, much of the detail of the military life of Paks has been drawn from Moon s own military experience. More than just that, the world is richly drawn, and interesting. I read this book in a hotel room in Chicago, since, as usual, there was nothing really interesting on TV, and I don t get the whole bar scene.

6 November 2007

Manoj Srivastava: Manoj: Continuous Automated Build and Integration Environment

One of the things I have been tasked to do in my current assignment is to create a dashboard of the status of various software components created by different contractors (participating companies) in the program. These software components are built by different development groups, utilizing unlike toolsets, languages and tools though I was able to get an agreement on the VCS (subversion yuck). Specifically, one should be able to tell which components pass pre-build checks, compile, can be installed, and pass unit and functional tests. There should be nightly builds, as well as builds whenever someone checks in code on the release branches. And, of course, the dashboard should be HTTP accessible, and be bright and, of course, shiny. My requirements were that since the whole project is not Java, there should be no dependencies on maven or ant or eclipse projects (or make, for that matter); that it should be able to do builds on multiple machines (license constraints restrict some software to Solaris or Windows), not suck up too much time from my real job (this is a service, if it is working well, you get no credit, if it fails, you are on the hot seat). And it should be something I can easily debug, so no esoteric languages (APL, haskell and Python :P) So, using continuous integration as a google search term, I found the comparison matrix at Damage Control I looked at anthill, and cruisecontrol, and the major drawback people seemed to think it had was that configuration was done by editing an XML file, as opposed to a (by some accounts, buggy) UI is not much of a factor for me. (See this). I also like the fact that it seems easier to plug in other components. I am uncomfortable with free software that has a commercial sibling; we have been burned once by UML software with those characteristics. Cruisecontrol, Damagecontrol, Tinderbox1 & tinderbox2, Continuum, and Sin match. I tried to see the demo versions; Sin s link led me to a site selling Myrtle Beach condo s, never a good sign. Continuum and Damagecontrol were currently down, so I could not do an evaluation. So, here are the ones I could get to with working demo pages: http://cclive.thoughtworks.com/ and http://tinderbox.mozilla.org/showbuilds.cgi?tree=SeaMonkey Cruisecontrol takes full control, checking things out of source control; and running the tests; which implies that all the software does build and run on the same machine this is not the case for me. Also, CC needs to publish the results/logs in XML; which seems to be a good fit for the java world; but might be a constraint for my use case. I like the tinderbox dashboard better, based on the information presented; but that is not a major issue. It also might be better suited for a distributed, open source development model; cruisecontrol seems slightly more centralized more on this below. cruisecontrol is certainly more mature; and we have some experience with it. Tinderbox has a client/server model, and communicates via EMAIL to a number of machines where the actual build/testing is done. This seems good. Then there is flamebox nice dashboard, derivative of tinderbox2; and seems pretty simple (perhaps too simple); and easily modifiable. However, none of these seemed right. There was too much of an assumption of a build and test model and few of them seemed to be a good fit for a distributed, Grid-based software development; so I continued looking. Cabie screen shot. I finally decided CABIE
Continuous Automated Build and Integration Environment. Cabie is a multi-platform, multi-cm client/server based application providing both command line and web-based access to real time build monitoring and execution information. Cabie builds jobs based upon configuration information stored in MySQL and will support virtually any build that can be called from the command line. Cabie provides a centralized collection point for all builds providing web based dynamic access, the collector is SQL based and provides information for all projects under Cabie s control. Cabie can be integrated with bug tracking systems and test systems with some effort depending on the complexity of those systems. With the idea in mind that most companies create build systems from the ground up, Cabie was designed to not have to re-write scripted builds but instead to integrate existing build scripts into a smart collector. Cabie provides rapid email notification and RSS integration to quickly handle build issues. Cabie provides the ability to run builds in parallel, series, to poll jobs or to allow the use of scripted nightly builds. Cabie is perfect for agile development in an environment that requires multiple languages and tools. Cabie supports Perforce, Subversion and CVS. The use of a backend broker allows anyone with perl skills to write support for additional CM systems.
The nice people at Yo Linux have provided a Tutorial for the process. I did have to make some changes to get things working (mostly in line with the changes recommended in the tutorial, but not exactly the same. I have sent the patches upstream, but upstream is not sure how much of it they can use, since there has been major progress since the last release. The upstream is nice and responsive, and have added support in unreleased versions for using virtual machines to run the builds in (they use that to do the solaris/windows builds), improved the web interface using (shudder) PHP, and and all kinds of neat stuff.

5 November 2007

Manoj Srivastava: Manoj: Filtering accuracy: Hard numbers

I have often posted on the accuracy of my mail filtering mechanisms on the mailing lists (I have not had a false positive in years, and I stash all discards/rejects locally, and do spot checks frequently; and I went through 6 months of exhaustive checks when I put this system in place). False negatives are down to about 3-4 a month (0.019%). Yes, that is right: I am claiming that my classification correctness record is 99.92 (99.98% accuracy for messages my classifiers are sure about). Incorrectly classified unsure ham is about 3-4(0.019%) a month; incorrectly classified unsure Spam is roughly the same, perhaps a little higher. Adding these to the incorrect classification, my best estimate of not confidently classified mail is 0.076%, based on the last 60 days of data (which is what gets you the 99.92%). I get unsure/retrain messages at the rate of about 20 a day (about 3.2% of non-spam email) about 2/3 rds of which are classified correctly; but either SA and crm114 disagree, or crm114 is unsure. So I have to look at about 20 messages a day to see if a ham message slipped in there; and train my filters based on these; and the process is highly automated (just uses my brain as a classifier). The mail statistics can be seen on my mail server. Oh, my filtering front end also switches between reject/discard and turns grey listing on and off based on whether or not the mail is coming from mailing lists/newsletters I have authorized; mimedefang-filter However, all these numbers are manually gathered, and I still have not gotten around to automating my setup s overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114. I ll update the numbers below via cron. First, some context: when training CRM114 using the mailtrainer command, one can specify to leave out a certain percentage of the training set in the learn phase, and run a second pass over the mails so skipped to test the accuracy of the training. The way you do this is by specifying a regular expression to match the file names. Since my training set has message numbers, it was simple to use the least significant two digits as a regexp; but I did not like the idea of always leaving out the same messages. So I now generate two sets of numbers for every training run, and leave out messages with those two trailing digits, in effect reserving 2% of all mails for the accuracy run. An interesting thing to note is the assymetry in the accuracy: CRM114 has never identified a Spam message incorrectly. This is because the training mechanism is skewed towards letting a few spam messages slip through, rather than let a good message slip into the spam folder. I like that. So, here are the accuracy numbers for CRM114; adding in Spamassassin into the mix only improves the numbers. Also, I have always felt that a freshly learned css file is somewhat brittle in the sense that if one trains an unsure message, and then tried to TUNE (Train Until No Errors) the css file, a large number of runs through the training set are needed until the thing stabilizes. So it is as if the learning done initially was minimalistic, and adding the information for the new unsure message required all kinds of tweaking. After a while TOEing (Training on Errors) and TUNEing, this brittleness seems to get hammered out of the CSS files. I also expect to see accuracy rise as the css files get less brittle The table below starts with data from a newly minted .css file.
Accuracy number and validation regexp
Date Corpus Ham Spam Overall Validation
  Size Count Correct Accuracy Count Correct Accuracy Count Correct Accuracy Regexp
Wed Oct 31 10:22:23 UTC 2007 43319 492 482 97.967480 374 374 100.000000 866 856 98.845270 [1][6][_][_] [0][3][_][_]
Wed Oct 31 17:32:44 UTC 2007 43330 490 482 98.367350 378 378 100.000000 868 860 99.078340 [3][7][_][_] [2][3][_][_]
Thu Nov 1 03:01:35 UTC 2007 43334 491 483 98.370670 375 375 100.000000 866 858 99.076210 [2][0][_][_] [7][9][_][_]
Thu Nov 1 13:47:55 UTC 2007 43345 492 482 97.967480 376 376 100.000000 868 858 98.847930 [1][2][_][_] [0][2][_][_]
Sat Nov 3 18:27:00 UTC 2007 43390 490 480 97.959180 379 379 100.000000 869 859 98.849250 [4][1][_][_] [6][4][_][_]
Sat Nov 3 22:38:12 UTC 2007 43394 491 482 98.167010 375 375 100.000000 866 857 98.960740 [3][1][_][_] [7][8][_][_]
Sun Nov 4 05:49:45 UTC 2007 43400 490 483 98.571430 377 377 100.000000 867 860 99.192620 [4][6][_][_] [6][8][_][_]
Sun Nov 4 13:35:15 UTC 2007 43409 490 485 98.979590 377 377 100.000000 867 862 99.423300 [3][7][_][_] [7][9][_][_]
Sun Nov 4 19:22:02 UTC 2007 43421 490 486 99.183670 379 379 100.000000 869 865 99.539700 [7][2][_][_] [9][4][_][_]
Mon Nov 5 05:47:45 UTC 2007 43423 490 489 99.795920 378 378 100.000000 868 867 99.884790 [4][0][_][_] [8][3][_][_]
As you can see, the accuracy numbers are trending up, and already are nearly up to the values observed on my production system.

4 November 2007

Manoj Srivastava: Manoj: The White Company

I had somehow managed to miss out on The White Company while I was growing up and devouring all of Sherlock Holmes stories and The Lost World. This is a pity, since I would have like this bit of the hundred years war much better when I was young and uncritical. Oh, I do like the book. The pacing is fast, if somewhat predictable. The book is well researched, and leads you from one historic event to the other, and is peppered with all kinds of historical figures, and I believe it to be quite authentic in it s period settings. Unfortunately, there is very little character development, and though the characters are deftly sketched, they all lack depth, which would not have bothered the young me. Also, Sir John Hawkwood, of the white company, is mentioned only briefly in passing. This compares less favourably than Walter Scott s Quentin Durward, set in a period less than 80 years in the future. but then, I ve always have had a weakness for Scott. As for Conan Doyle, the lost world was far more gripping. I am now looking for books about Hawkwood, a mercenary captain mentioned in this book, as well as Dickson s Childe Cycle books. The only books I have found so far on the golden age of the Condottieri are so darned expensive.

10 October 2007

Martin F. Krafft: Packaging with Git

Introduction I gave a joint presentation with Manoj at Debconf7 about using distributed version control for Debian packaging, and I volunteered to do an on-line workshop about using Git for the task, so it's about time that I should know how to use Git for Debian packaging, but it turns out that I don't. Or well, didn't. After I made a pretty good mess out of the mdadm packaging repository (which is not a big problem as it's just ugly history up to the point when I start to get it right), I decided to get down with the topic and figure it out once and for all. I am writing this post as I put the pieces together. It's been cooking for a week, simply so I could gather enough feedback. I am aware that Git is not exactly a showcase of usability, so I took some extra care to not add to the confusion. It may be the first post in a series, because this time, I am just covering the case of mdadm, for which upstream also uses Git and where I am the only maintainer, and I shall pretend that I am importing mdadm to version control for the first time, so there won't be any history juggling. Future posts could well include tracking Subversion repositories with git-svn, and importing packages previously tracked therewith, but this ain't no promise! (well, that last post is already being drafted, but far from finished; you have been warned!) I realise that git-buildpackage exists, but imposes a rather strict branch layout and tagging scheme, which I don't want to adhere to. And gitpkg (Romain blogged about it recently), deserves another look since, according to its author, it does not impose anything on its user. But in any case, before using such tools (and possibly extending them to allow for other layouts), I'd really rather have done it by hand a couple of times to get the hang of it and find out where the culprits lie. Now, enough of the talking, just one last thing: I expect this blog post to change quite a bit as I get feedback. Changes shall be highlighted in bold typeface.
Setting up the infrastructure First, we prepare a shared repository on git.debian.org for later use (using collab-maint for illustration purposes), download the Debian source package we want to import (version 2.6.3+200709292116+4450e59-3 at time of writing, but I pretend it's -2 because we shall create -3 further down ), set up a local repository, and link it to the remote repository. Note that there are other ways to set up the infrastructure, but this happens to be the one I prefer, even though it's slightly more complicated:
$ ssh alioth
$ cd /git/collab-maint
$ ./setup-repository pkg-mdadm mdadm Debian packaging
$ exit
$ apt-get source --download-only mdadm
$ mkdir mdadm && cd mdadm
$ git init
$ git remote add origin ssh://git.debian.org/git/collab-maint/pkg-mdadm
$ git config branch.master.merge refs/heads/master
Now we can use git-pull and git-push, except the remote repository is empty and we can't pull from there yet. We'll save that for later. Instead, we tell the repository about upstream's Git repository. I am giving you the git.debian.org URL though, simply because I don't want upstream repository (which lives on an ADSL line) hammered in response to this blog post:
$ git remote add upstream-repo git://git.debian.org/git/pkg-mdadm/mdadm
Since we're using the upstream branch of the pkg-mdadm repository as source (and don't want all the other mess I created in that repository), we'll first limit the set of branches to be fetched (I could have used the -t option in the above git-remote command, but I prefer to make it explicit that we're doing things slightly differently to protect upstream's ADSL line).
$ git config remote.upstream-repo.fetch \
    +refs/heads/upstream:refs/remotes/upstream-repo/upstream
And now we can pull down upstream's history and create a local branch off it. The "no common commits" warning can be safely ignored since we don't have any commits at all at that point (so there can't be any in common between the local and remote repository), but we know what we're doing, even to the point that we can forcefully give birth to a branch, which is because we do not have a HEAD commit yet (our repository is still empty):
$ git fetch upstream-repo
warning: no common commits
[ ]
  # in the real world, we'd be branching off upstream-repo/master
$ git checkout -b upstream upstream-repo/upstream
warning: You appear to be on a branch yet to be born.
warning: Forcing checkout of upstream-repo/upstream.
Branch upstream set up to track remote branch
  refs/remotes/upstream-repo/upstream.
$ git branch
* upstream
$ ls   wc -l
77
Importing the Debian package Now it's time to import Debian's diff.gz remember how I pretend to use version control for package maintenance for the first time. Oh, and sorry about the messy file names, but I decided it's best to stick with real data in case you are playing along: Since we're applying the diff against version 2.6.3+200709292116+4450e59, we ought to make sure to have the repository at the same state. Upstream never "released" that version, but I encoded the commit ID of the tip when I snapshotted it: 4450e59, so we branch off there. Since we are actually tracking the git.debian.org pkg-mdadm repository instead of upstream, you can use the tag I made. Otherwise you could consider tagging yourself:
$ #git tag -s mdadm-2.6.3+200709292116+4450e59 4450e59
$ git checkout -b master mdadm-2.6.3+200709292116+4450e59
$ zcat ../mdadm_2.6.3+200709292116+4450e59-2.diff.gz   git apply
The local tree is now "debianised", but Git does not know about the new and changed files, which you can verify with git-status. We will split the changes made by Debian's diff.gz across several branches.
The idea of feature branches We could just create a debian branch, commit all changes made by the diff.gz there, and be done with it. However, we might want to keep certain aspects of Debianisation separate, and the way to do that is with feature branches (also known as "topic" branches). For the sake of this demonstration, let's create the following four branches in addition to the master branch, which holds the standard Debian files, such as debian/changelog, debian/control, and debian/rules:
  • upstream-patches will includes patches against the upstream code, which I submit for upstream inclusion.
  • deb/conffile-location makes /etc/mdadm/mdadm.conf the default over /etc/mdadm.conf and is Debian-specific (thus the deb/ prefix).
  • deb/initramfs includes the initramfs hook and script, which I want to treat separately but not submit upstream.
  • deb/docs similarly includes Debian-only documentation I add to the package as a service to Debian users.
If you're importing a Debian package using dpatch, you might want to convert every dpatch into a single branch, or at least collect logical units into separate branches. Up to you. For now, our simple example suffices. Keep in mind that it's easy to merge two branch and less trivial to split one into two. Why? Well, good question. As you will see further down, the separation between master and deb/initramfs actually makes things more complicated when you are working on an issue spanning across both. However, feature branches also bring a whole lot of flexibility. For instance, with the above separation, I could easily create mdadm packages without initramfs integration (see #434934), a disk-space-conscious distribution like grml might prefer to leave out the extra documentation, and maybe another derivative doesn't like the fact that the configuration file is in a different place from upstream. With feature branches, all these issues could be easily addressed by leaving out unwanted branches from the merge into the integration/build branch (see further down). Whether you use feature branches, and how many, or whether you'd like to only separate upstream and Debian stuff is entirely up to you. For the purpose of demonstration, I'll go the more complicated way.
Setting up feature branches So let's commit the individual files to the branches. The output of the git-checkout command shows modified files that have not been committed yet (which I trim after the first example); Git keeps these across checkouts/branch changes. Note that the ./debian/ directory does not show up as Git does not know about it yet (git-status will tell you that it's untracked, or rather: contains untracked files since Git does not track directories at all):
$ git checkout -b upstream-patches mdadm-2.6.3+200709292116+4450e59
M Makefile
M ReadMe.c
M mdadm.8
M mdadm.conf.5
M mdassemble.8
M super1.c
$ git add super1.c     #444682
$ git commit -s
  # i now branch off master, but that's the same as 4450e59 actually
  # i just do it so i can make this point 
$ git checkout -b deb/conffile-location master
$ git add Makefile ReadMe.c mdadm.8 mdadm.conf.5 mdassemble.8
$ git commit -s
$ git checkout -b deb/initramfs master
$ git add debian/initramfs/*
$ git commit -s
$ git checkout -b deb/docs master
$ git add RAID5_versus_RAID10.txt md.txt rootraiddoc.97.html
$ git commit -s
  # and finally, the ./debian/ directory:
$ git checkout master
$ chmod +x debian/rules
$ git add debian
$ git commit -s
$ git branch
  deb/conffile-location
  deb/docs
* master
  upstream
  upstream-patches
At this time, we push our work so it won't get lost if, at this moment, aliens land on the house, or any other completely plausible event of apocalypse descends upon you. We'll push our work to git.debian.org (the origin, which is the default destination and thus needs not be specified) by using git-push --all, which conveniently pushes all local branches, thus including the upstream code; you may not want to push the upstream code, but I prefer it since it makes it easier to work with the repository, and since most of the objects are needed for the other branches anyway after all, we branched off the upstream branch. Specifying --tags instead of --all pushes tags instead of heads (branches); you couldn't have guessed that! See this thread if you (rightfully) think that one should be able to do this in a single command (which is not git push refs/heads/* refs/tags/*)
$ git push --all
$ git push --tags
Done. Well, almost
Building the package (theory) Let's build the package. There seem to be two (sensible) ways we could do this, considering that we have to integrate (merge) the branches we just created, before we fire off the building scripts:
  1. by using a temporary (or "throw-away") branch off upstream, where we integrate all the branches we have just created, build the package, tag our master branch (it contains debian/changelog), and remove the temporary branch. When a new package needs to be built, we repeat the process.
  2. by using a long-living integration branch off upstream, into which we merge all our branches, tag the branch, and build the package off the tag. When a new package comes around, we re-merge our branches, tag, and build.
Both approaches have a certain appeal to me, but I settled for the second, for two reasons, the first of which leads to the second:
  1. When I upload a package to the Debian archive, I want to create a tag which captures the exact state of the tree from which the package was built, for posterity (I will return to this point later). Since the throw-away branches are not designed to persist and are not uploaded to the archive, tagging the merging commit makes no sense. Thus, the only way to properly identify a source tree across all involved branches would be to run git-tag $branch/$tagname $branch for each branch, which is purely semantic and will get messy sooner or later.
  2. As a result of the above: when Debian makes a new stable release, I would like to create a branch corresponding to the package in the stable archive at the time, for security and other proposed updates. I could rename my throw-away branch, if it still existed, or I could create a new branch and merge all other branches, using the (semantic) tags, but that seems rather unfavourable.
So instead, I use a long-living integration branch, notoriously tag the merge commits which produced the tree from which I built the package I uploaded, and when a certain version ends up in a stable Debian release, I create a maintenance branch off the one, single tag which corresponds to the very version of the package distributed as part of the Debian release. So much for the theory. Let's build, already!
Building the package (practise) So we need a long-living integration branch, and that's easier done than said:
$ git checkout -b build mdadm-2.6.3+200709292116+4450e59
Now we're ready to build, and the following procedure should really be automated. I thus write it like a script, called poor-mans-gitbuild, which takes as optional argument the name of the (upstream) tag to use, defaulting to upstream (the tip):
#!/bin/sh
set -eu
git checkout master
debver=$(dpkg-parsechangelog   sed -ne 's,Version: ,,p')
git checkout build
git merge $ 1:-upstream 
git merge upstream-patches
git merge master
for b in $(git for-each-ref --format='%(refname)' refs/heads/deb/*); do
  git merge -- $b
done
git tag -s debian/$debver
debuild   # will ignore .git automatically
git checkout master
Note how we are merging each branch in turn, instead of using the octopus merge strategy (which would create a commit with more than two parents) for reasons outlined in this post. An octopus-merge would actually work in our situation, but it will not always work, so better safe than sorry (although you could still achieve the same result). If you discover during the build that you forgot something, or the build script failed to run, just remove the tag, undo the merges, checkout the branch to which you need to commit to fix the issue, and then repeat the above build process:
$ git tag -d debian/$debver
$ git checkout build
$ git reset --hard upstream
$ git checkout master
$ editor debian/rules    # or whatever
$ git add debian/rules
$ git commit -s
$ poor-mans-gitbuild
Before you upload, it's a good idea to invoke gitk --all and verify that all goes according to plan:
screenshot of gitk after the above steps
When you're done and the package has been uploaded, push your work to git.debian.org, as before. Instead of using --all and --tags, I now specify exactly which refs to push. This is probably a good habit to get into to prevent publishing unwanted refs:
$ git push origin build tag debian/2.6.3+200709292116+4450e59-3
Now take your dog for a walk, or play outside, or do something else not involving a computer or entertainment device.
Uploading a new Debian version If you are as lucky as I am, the package you uploaded still has a bug in the upstream code and someone else fixes it before upstream releases a new version, then you might be in the position to release a new Debian version. Or maybe you just need to make some Debian-specific changes against the same upstream version. I'll let the commands speak for themselves:
$ git checkout upstream-patches
$ git-apply < patch-from-lunar.diff   #444682 again
$ git commit --author 'J r my Bobbio <lunar@debian.org>' -s
  # this should also be automated, see below
$ git checkout master
$ dch -i
$ dpkg-parsechangelog   sed -ne 's,Version: ,,p'
2.6.3+200709292116+4450e59-3
$ git commit -s debian/changelog
$ poor-mans-gitbuild
$ git push
$ git push origin tag debian/2.6.3+200709292116+4450e59-3
That first git-push may require a short explanation: without any arguments, git-push updates only the intersection of local and remote branches, so it would never push a new local branch (such as build above), but it updates all existing ones; thus, you cannot inadvertedly publish a local branch. Tags still need to be published explicitly.
Hacking on the software Imagine: on a rainy Saturday afternoon you get bored and decide to implement a better way to tell mdadm when to start which array. Since you're a genius, it'll take you only a day, but you do make mistakes here and there, so what could be better than to use version control? However, rather than having a branch that will live forever, you are just creating a local branch, which you will not publish. When you are done, you'll feed your work back into the existing branches. Git makes branching really easy and as you may have spotted, the poor-mans-gitbuild script reserves an entire branch namespace for people like you:
$ git checkout -b tmp/start-arrays-rework master
Unfortunately (or fortunately), fixing this issue will require work on two branches, since the initramfs script and hook are maintained in a separate branch. There are (again) two ways in which we can (sensibly) approach this:
  • create two separate, temporary branches, and switch between them as you work.
  • merge both into the temporary branch and later cherry-pick the commits into the appropriate branches.
I am undecided on this, but maybe the best would be a combination: merge both into a temporary branch and later cherry-pick the commits into two additional, temporary branches until you got it right, and then fast-forward the official branches to their tips:
$ git merge master deb/initramfs
$ editor debian/mdadm-raid                     #  
$ git commit -s debian/mdadm-raid
$ editor debian/initramfs/script.local-top     #  
$ git commit -s debian/initramfs/script.local-top
[many hours of iteration pass ]
[  until you are done]
$ git checkout -b tmp/start-arrays-rework-init master
  # for each commit $c in tmp/start-arrays-rework
  # applicable to the master branch:
$ git cherry-pick $c
$ git checkout -b tmp/start-arrays-rework-initramfs deb/initramfs
  # for each commit $c in tmp/start-arrays-rework
  # applicable to the deb/initramfs branch:
$ git cherry-pick $c
This is assuming that all your commits are logical units. If you find several commits which would better be bundled together into a single commit, this is the time to do it:
$ git cherry-pick --no-commit <commit7>
$ git cherry-pick --no-commit <commit4>
$ git cherry-pick --no-commit <commit5>
$ git commit -s
Before we now merge this into the official branches, let me briefly intervene and introduce the concept of a fast-forward. Git will "fast-forward" a branch to a new tip if it decides that no merge is needed. In the above example, we branched a temporary branch (T) off the tip of an official branch (O) and then worked on the temporary one. If we now merge the temporary one into the official one, Git determines that it can actually squash the ancestry into a single line and push the official branch tip to the same ref as the temporary branch tip. In cheap (poor man's), ASCII notation:
- - - O             >> merge T >>     - - - = - - OT
         - - T      >>  into O >>
This works because no new commits have been made on top of O (if there would be any, we might be able to rebase, but let's not go there quite yet; rebasing is how you shoot yourself in the foot with Git). Thus we can simply do the following:
$ git checkout deb/initramfs
$ git merge tmp/start-arrays-rework-initramfs
$ git checkout master
$ git merge tmp/start-arrays-rework-init
and test/build/push the result. Or well, since you are not an mdadm maintainer (We^W I have open job positions! Applications welcome!), you'll want to submit your work as patches via email:
$ git format-patch -s -M origin/master
This will create a number of files in the current directory, one corresponding for each commit you made since origin/master. Assuming each commit is a logical unit, you can now submit these to an email address. The --compose option lets you write an introductory message, which is optional:
$ git send-email --compose --to your@email.address <file1> <file2> < >
Once you've verified that everything is alright, swap your email address for the bug number (or the pkg-mdadm-devel list address). Thanks (in advance) for your contribution! Of course, you may also be working on a feature that you want to go upstream, in which case you'd probably branch off upstream-patches (if it depends on a patch not yet in upstream's repository), or upstream (if it does not):
$ git checkout -b tmp/cool-feature upstream
[ ]
when a new upstream version comes around After a while, upstream may have integrated your patches, in addition to various other changes, to give birth to mdadm-2.6.4. We thus first fetch all the new refs and merge them into our upstream branch:
$ git fetch upstream-repo
$ git checkout upstream
$ git merge upstream-repo/master
we could just as well have executed git-pull, which with the default configuration would have done the same; however, I prefer to separate the process into fetching and merging. Now comes the point when many Git people think about rebasing. And in fact, rebasing is exactly what you should be doing, iff you're still working on an unpublished branch, such as the previous tmp/cool-feature off upstream. By rebasing your branch onto the updated upstream branch, you are making sure that your patch will apply cleanly when upstream tries it, because potential merge conflicts would be handled by you as part of the rebase, rather than by upstream:
$ git checkout tmp/cool-feature
$ git rebase upstream
What rebasing does is quite simple actually: it takes every commit you made since you branched off the parent branch and records the diff and commit message. Then, for each diff/commit_message pair, it creates a new commit on top of the new parent branch tip, thus rewrites history, and orphans all your original commits. Thus, you should only do this if your branch has never been published or else you would leave people who cloned from your published branch with orphans.
If this still does not make sense, try it out: create a (source) repository, make a commit (with a meaningful commit message), branch B off the tip, make a commit on top of B (with a meaningful message), clone that repository and return to the source repository. There, checkout the master, make a commit (with a ), checkout B, rebase it onto the tip of master, make a commit (with a ), and now git-pull from the clone; use gitk to figure out what's going on.
So you should almost never rebase a published branch, and since all your branches outside of the tmp/* namespace are published on git.debian.org, you should not rebase those. But then again, Pierre actually rebases a published branch in his workflow, and he does so with reason: his patches branch is just a collection of branches to go upstream, from which upstream cherry-picks or which upstream merges, but which no one tracks (or should be tracking). But we can't (or at least will not at this point) do this for our feature branches (though we could treat upstream-patches that way), so we have to merge. At first, it suffices to merge the new upstream into the long-living build branch, and to call poor-mans-gitbuild, but if you run into merge conflicts or find that upstream's changes affect the functionality contained in your feature branches, you need to actually fix those. For instance, let's say that upstream started providing md.txt (which I previously provided in the deb/docs branch), then I need to fix that branch:
$ git checkout deb/docs
$ git rm md.txt
$ git commit -s
That was easy, since I could evade the conflict. But what if upstream made a change to Makefile, which got in the way with my configuration file location change? Then I'd have to merge upstream into deb/conffile-location, resolve the conflicts, and commit the change:
$ git checkout deb/conffile-location
$ git merge upstream
CONFLICT!
$ git-mergetool
$ git commit -s
When all conflicts have been resolved, I can prepare a new release, as before:
$ git checkout master
$ dch -i
$ dpkg-parsechangelog   sed -ne 's,Version: ,,p'
2.6.3+200709292116+4450e59-3
# git commit -s debian/changelog
$ poor-mans-gitbuild
# git push
$ git push origin tag debian/2.6.3+200709292116+4450e59-3
Note that Git often appears smart about commits that percolated upstream: since upstream included the two commits in upstream-patches in his 2.6.4 release, my upstream-patches branch got effectively annihilated, and Git was smart enough to figure that out without a conflict. But before you rejoice, let it be told that this does not always work.
Creating and using a maintenance branch Let's say Debian "lenny" is released with mdadm 2.7.6-1, then:
$ git checkout -b maint/lenny debian/2.7.6-1
You might do this to celebrate the release, or you may wait until the need arises. We've already left the domain of reality ("lenny" is not yet released), so the following is just theory. Now, assume that a security bug is found in mdadm 2.7.6 after "lenny" was released. Upstream is already on mdadm 2.7.8 and commits deadbeef and c0ffee fix the security issue, then you'd cherry-pick them into the maint/lenny branch:
$ git checkout upstream
$ git pull
$ git checkout maint/lenny
$ git cherry-pick deadbeef
$ git cherry-pick c0ffee
If there are no merge conflicts (which you'd resolve with git-mergetool), we can just go ahead to prepare the new package:
$ dch -i
$ dpkg-parsechangelog   sed -ne 's,Version: ,,p'
2.7.6-1lenny1
$ git commit -s debian/changelog
$ poor-mans-gitbuild
$ git push origin maint/lenny
$ git push origin tag debian/2.7.6-1lenny1
Future directions It should be trivial to create the Debian source package directly from the repository, and in fact, in response to a recent blog post of mine on the dispensability of pristine upstream tarballs, two people showed me their scripts to do it. My post also caused Joey Hess to clarify his position on pristine tarballs, before he went out to implement dpkg-source v3. This looks very promising. Yet, as Romain argues, there are benefits with simple patch management systems. Exciting times ahead! In addition to creating source packages from version control, a couple of other ideas have been around for a while:
  • create debian/changelog from commit log summaries when you merge into the build branch.
  • integrate version control with the BTS, bidirectionally:
    • given a bug report, create a temporary branch and apply any patches found in the bug report.
    • upon merging the temporary branch back into the feature branch it modifies, generate a patch, send it to the BTS and tag the bug report + pending patch.
And I am sure there are more. If you have any, I'd be interested to hear about them!
Wrapping up I hope this post was useful. Thank you for reading to the end, this was probably my longest blog post ever. I want to thank Pierre Habouzit, Johannes Schindelin, and all the others on the #git/freenode IRC channel for their tutelage. Thanks also to Manoj Srivastava, whose pioneering work on packaging with GNU arch got me started on most of the concepts I use in the above. And of course, the members of the the vcs-pkg mailing list for the various discussions on this subject, especially those who participated in the thread leading up to this post. Finally, thanks to Linus and Junio for Git and the continuously outstanding high level of support they give. If you are interested in the topic of using version control for distro packaging, I invite you to join the vcs-pkg mailing list and/or the #vcs-pkg/irc.oftc.net IRC channel. NP: Aphex Twin: Selected Ambient Works, Volume 2 (at least when I started writing )

28 September 2007

Martin F. Krafft: Counting developers

For my research I wanted to know how to obtain the exact number of Debian developers. Thanks to help from Andreas Barth and Manoj Srivastava, I can now document the procedure:
$ ldapsearch -xLLLH ldap://db.debian.org -b ou=users,dc=debian,dc=org \
  gidNumber=800 keyFingerPrint \
    sed -rne ':s;/^dn:/bl;n;bs;:l;n;/^keyFingerPrint:/ p;bs ' \
    wc -l
1049
This actually seems enough as I do not recall any new maintainers being added since the last call for votes, which gives 1049 as well. Andreas told me to count the number of entries in LDAP with GID 800 and an associated key in the Debian keyring. Manoj's dvt-quorum script also takes the Debian keyrings (GPG and PGP) into account, so I did the same:
$ ldapsearch -xLLLH ldap://db.debian.org -b ou=users,dc=debian,dc=org \
  gidNumber=800 keyFingerPrint \
    sed -rne ':s;/^dn:/bl;n;bs;
              :l;n;/^keyFingerPrint:/ s,keyFingerPrint: ,,p;bs ' \
    sort -u > ldapfprs
$ rsync -az --progress \
  keyring.debian.org::keyrings/keyrings/debian-keyring.gpg \
  ./debian-keyring.gpg
$ gpg --homedir . --no-default-keyring --keyring debian-keyring.gpg \
  --no-options --always-trust --no-permission-warning \
  --no-auto-check-trustdb --armor --rfc1991 --fingerprint \
  --fast-list-mode --fixed-list-mode --with-colons --list-keys \
    sed -rne 's,^fpr:::::::::([[:xdigit:]]+):,\1,p' \
    sort -u > gpgfprs
$ rsync -az --progress \
  keyring.debian.org::keyrings/keyrings/debian-keyring.pgp \
  ./debian-keyring.pgp
$ gpg --homedir . --no-default-keyring --keyring debian-keyring.pgp \
  --no-options --always-trust --no-permission-warning \
  --no-auto-check-trustdb --armor --rfc1991 --fingerprint \
  --fast-list-mode --fixed-list-mode --list-keys \
    sed -rne 's,^[[:space:]]+Key fingerprint = ,,;T;s,[[:space:]]+,,gp' \
    sort -u > pgpfprs
$ sort ldapfprs pgpfprs gpgfprs   uniq -c \
    egrep -c '^[[:space:]]+2[[:space:]]'
1048
MAN OVER BOARD! Who's the black sheep? Update: In the initial post, I forgot the option --fixed-list-mode and hit a minor bug in gnupg. I have since updated the above commands. Thus, there is no more black sheep and the rest of this post only lingers here for posterity.
while read i; do
  grep "^$ i $" pgpfprs gpgfprs   echo $i >&2
done < ldapfprs >/dev/null
which returns 9BF093BC475BABF8B6AEA5F6D7C3F131AB2A91F5
$ gpg --list-keys 9BF093BC475BABF8B6AEA5F6D7C3F131AB2A91F5
pub   4096R/AB2A91F5 2004-08-20
uid                  James Troup <james@nocrew.org>
our very own keyring master James Troup. So has James subverted the project? Is he actually not a Debian developer? Given the position(s) he holds, does that mean that the project is doomed? Ha! I am so tempted to end right here, but since my readers are used to getting all the facts, here's the deal: James is so special that he gets to be the only one to have a key in our GPG keyring which can be used for encryption, or so I found out as I was researching this. Now this bug in gnupg actually causes his fingerprint not to be printed. Until this is fixed (if ever), simply leave out --fast-list-mode in the above commands. NP: Oceansize: Effloresce

27 September 2007

Eddy Petri&#537;or: Invariant sections...

Thanks to Holger's post I saw this really educational comic strip.

To those who don't get the joke: the invariant sections could put you in that exact position. Please read the chapter about "Invaraint sections" from the "Draft Debian Position Statement about the GNU Free Documentation License(GFDL)"

21 August 2007

Manoj Srivastava: Arch Hook

All the version control systems I am familiar with run scripts on checkout and commit to take additional site specific actions, and arch is no different. Well, actually, arch is perhaps different in the sense that arch runs a script on almost all actions, namely, ~/.arch-params/hook script. Enough information is passed in to make this mechanism one of the most flexible I have had the pleasure to work with. In my hook script, I do the following things: I d be happy to hear about what other people add to their commit scripts, to see if I have missed out on anything.

20 August 2007

Manoj Srivastava: Mail Filtering with CRM114: Part 4

Training the Discriminators It has been a while since I posted on this category actually, it has been a long while since my last blog. When I last left you, I had mail (mbox format) folders called ham and/or junk, which were ready to be used for training either CRM114 or Spamassassin or both. Setting up Spamassassin This post lays the groundwork for the training, and details how things are set up. The first part is setting up Spamassassin. One of the things that bothered me about the default settings for Spamassassin was how swiftly Bayes information was expired; indeed, it seems really eager to dumb the Bayes information (don t they trust their engine?). I have spent some effort building a large corpus, and keeping ti clean, but Spamassassin would discard most of the information from the DB after training over my corpus, and the decrease in accuracy was palpable. To prevent this information from leeching away, I firstly increased the size of the database, and turned off automatic expiration, by putting the following lines into ~/.spamassassin/user_prefs:
bayes_expiry_max_db_size  4000000
bayes_auto_expire         0

I also have regularly updated spam rules from the spamassassin rules emporium to improve the efficiency of the rules; my current user_prefs is available as an example. Initial training I keep my Spam/Ham corpus under the directory /backup/classify/Done, in the subdirectories Ham and Spam. At the time of writing, I have approximately 20,000 mails in each of these subdirectories, for a total of 41,000+ emails. I have created a couple of scripts to train the discriminators from scratch using the extant Spam corpus; and these scripts are also used for re-learning, for instance, when I moved from a 32-bit machine to a 64-bit one, or when I change CRM114 discrimators. I generally run them from ~/.spamassassin/ and ~/var/lib/crm114 (which contains my CRM114 setup) directories. I have found that training Spamassassin works best if you alternate Spam and Ham message chunks; and this Spamassassin learning script delivers chunks of 50 messages for learning. With CRM114, I have discovered that it is not a good idea to stop learning based on the number of times the corpus has been gone over; since stopping before all messages i the Corpus are correctly handled is also disastrous. So I set the repeat count to a ridiculously high number, and tell mailtrainer to continue training until a streak larger than the sum of Spam and Ham messages has occurred. This CRM114 trainer script does the hob nicely; running it under screen is highly recommend. Routine updates Coming back to where we left off, we had mail (mbox format) folders called ham and/or junk sitting in the local mail delivery directory, which were ready to be used for training either CRM114 or Spamassassin or both. There are two scripts that help me automate the training. The first script, called mail-process, does most of the heavy listing. This processes a bunch of mail folders, which are supposed to contain mail which is either all ham or all spam, indicated by the command line arguments. We go looking though every mail, and any mail where either the CRM114 or the Spamassassin judgement was not what we expected, we strip out mail gathering headers, and then we save the mail, one to a file, and we train the approprite filter. This ensures that we only train on error, and it does not matter if we accidentally try to train on correctly classified mail, since that would be a no-op (apart from increasing the size of the corpus). The second script, called mproc is a convenience front-end; it just calls mail-process with the proper command line arguments, and feeds them the ham and junk in sequence; and takes no arguments. So, after human classification, just calling mproc does the classification. This pretty much finishes the series of posts I had in mind about spam filtering, I hope it has been useful.

3 August 2007

Rapha&#235;l Hertzog: Thanks sam!

I really appreciated your last Bits of the DPL. I discover a DPL taking position on hot topics of the moment. I’m glad to have a DPL who is trying to fulfill his duty of leading discussions amongst developers. He gave his opinion on the current vote about “endorsing the concept of Debian Maintainers” (he’s in favor because it dilutes power) and also about Apt’s change to install Recommends by default. I’m glad to hear the encouraging news concerning volunteers for ftpmasters. By the way, if you have voted for Sam, and if Sam’s opinion bears any importance for you, you still have until saturday midnight (UTC) to change your vote if you wish so (like Russ did). Right now, only 289 DD have voted.

Next.

Previous.